SelectKBest | POC - atul5001/SelectKBest-Feature-Selection GitHub Wiki

SelectKBest Feature Selection || POC

Objective

  1. TAO engineers spend extra efforts tuning a Correlated model for relevant features.
    • Reason: In many cases relevant tag(s) having low correlation scores do not rank up in the feature list.
  2. SelectKBest Feature Selection:
    • Methodology to identify the logically related TAGS with Target tag.
    • Select the best features using a Statistical method that checks for significance along with linear relationship among the tags.

Dataset Selection

  1. All datasets extracted and used for these experiments were based on CV models in QA and PROD.
  2. CV models on the platform have a pre-existing trained model with suitable WB features.
    • This was used as the starting point of this exercise. The algorithm will be tasked with proposing a sub-set of features that improves the overall quality of the model (based on KPIs listed below).
  3. The PROD environment has a large number of CV models. When identifying suitable candidates for experimentation, tags were selected based on the following criteria:
    • The tag should have WhiteBOX(WB) features or best-correlated features used for modeling.

KPI's to look out for:

  1. The features proposed should match the most extent of WhiteBOX Features.
    • Additional features proposed should improve the model's performance power.
  2. We want to compare the Insample and Outsample:
    • Mean Absolute Error (MAE)
    • Root Mean Squared Error (RMSE)
    • R2 Score
  3. Increase in R2 Score and Decrease in MAE/RMSE is the key pattern to look out for after the methodologies have been applied.

Limitations/Caveats

  1. The initial POC/experiment was performed on a set of top 100 correlated features comprising the WhiteBOX features.
  2. Had concerns about doing this at scale.
  3. Tried this mini-batch approach of 100 features and found scores were consistent with the full batch.
    • However, it is unclear why this is the case?
      • Logically explained below.
  4. The situation it does not work well is the situation when distributions are not gaussian. Indeed, more gaussian the distributions are,the better this approach works.
    • Does ATPS + SelectKBest sounds good?

Methodology Proposed

SelectKBest

SelectKBest identifies the highly contributing features according to "k" highest scores.

  • f_regression scores
  • It provides us the ability to execute Correlation and F-Test simultaneously for the provided pool of features against the associated target tag.

A Brief architecture is explained in the below chart.

SelectKBest Architecture

SelectKBest Working Flowchart:

SelectKBest Working FloatChart

Challenges with Pearson Correlation

  1. Pearson correlation measures a linear relation and can be highly sensitive to outliers.
  2. It cannot distinguish between independent and dependent variables. Therefore, also if a relationship between two variables is found, Pearson’s r does not indicate which variable was β€˜the cause’ and which was β€˜the effect’.
  3. It assumes that there is always a linear relationship between the variables which might not be the case at all times.
  4. It can be easily misinterpreted as a high degree of correlation from large values of the correlation coefficient does not necessarily mean a very high linear relationship between the two variables.

F Regression Scoring Function

  • Univariate linear regression tests returning F-statistic and p-values.

  • Quick linear model for testing the effect of a single regressor, non-sequentially for many regressors.

  • This is done in 2 steps:

    • The cross correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)) using r_regression function.

    • It is converted to an F score and then to a p-value.

  • f_regression is derived from r_regression and will rank features in the same order if all the features are positively correlated with the target.

Note: however that contrary to f_regression, r_regression(represent's Pearson correlation results) values lie in [-1, 1] and can thus be negative. f_regression is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.

  • Intuition:

    F1 better discriminator than F2

    • Ref: https://datascience.stackexchange.com/questions/74465/how-to-understand-anova-f-for-feature-selection-in-python-sklearn-selectkbest-w

    • In the above reference, the kind of F-Test being performed is the ANOVA test where, we determine the variance of the features and how how well this feature discriminates between two classes.

    • What makes F1 better than F2?

      • The distance between means of class distributions on F1 is more than F2. (π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’_𝑏𝑒𝑑𝑀𝑒𝑒𝑛_π‘π‘™π‘Žπ‘ π‘ π‘’π‘ )
      • The variance of each single class according to F1 is less than those of F2. (π‘π‘œπ‘šπ‘π‘Žπ‘π‘‘π‘›π‘’π‘ π‘ _π‘œπ‘“_π‘π‘™π‘Žπ‘ π‘ π‘’π‘ )
      • Now we can easily say π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’_𝑏𝑒𝑑𝑀𝑒𝑒𝑛_π‘π‘™π‘Žπ‘ π‘ π‘’π‘ /π‘π‘œπ‘šπ‘π‘Žπ‘π‘‘π‘›π‘’π‘ π‘ _π‘œπ‘“_π‘π‘™π‘Žπ‘ π‘ π‘’π‘  is a good score! Higher this score is, better the feature discriminates between classes.

Risk(s) Involved with F-Regression

Assumption of Non-Normality

image

  1. F-Test technique involved, uses ANOVA (Analysis of Variance) methodology to test the significance of the features.

    • Assumptions of ANOVA:
      1. The experimental errors of your data are normally distributed.
      2. Equal variances between treatments.
        • Homogeneity of variances
        • Homoscedasticity
      3. Independence of samples
        • Each sample is randomly selected and independent.
  2. F-Regression may throw garbage value if the distribution of a feature tag/sensor is non-normal, but

    • For large N(sample size):
      • The assumption for Normality can be relaxed.
      • ANOVA not really compromised if data is non-normal.
  3. Assumption of Normality is important when:

    • Very small N(sample size).
    • Highly non-normal.
    • Small effect size.
  4. Simple Chart explaining the same:

    image

End Note: With very large sample size, the assumption on Normality in ANOVA is flexible with non-normal distributions and the F-Test technique is considered to be robust on these distributions.

Source:

  1. To handle edge cases:

       # Scaling
       scaler = StandardScaler()  
       scaled_features = scaler.fit_transform(feature_df)  
       scaled_features_df = pd.DataFrame(scaled_features, index=feature_df.index, columns=feature_df.columns)  
    
       # Checking if a Feature Tag is Normally Distributed or not as per the ANOVA Assumption
       for col in scaled_features_df.columns:
           # Checking Uniformity
           x = kstest(scaled_features_df[col],"norm")
           p_val = x[1]
           # Now checking if p_val is greater than 0.05
           if p_val > 0.05: # non-uniform
               # Applying Box-Cox transformation 
               fitted_data, fitted_lambda = stats.boxcox(scaled_features_df[col])
               scaled_features_df[col] = transformed_data
           else:
               pass
    

Working Snippet of Code

def univariate_methodology(feature_df,target_df,max_features_user_wants,scoring_function):
    
    '''
    Input:
        - feature_df: dataframe consisting features data
        - target_df: dataframe containing target data
        - max_features_user_wants: Maximum feature to be shortlisted from the CORRELATED pool of features
        
    Execution:
        - SelectKBest requires the dataset to be in Numpy characteristic.
        - Then we eventually specify the TOP features to choose and using which score_function
        - Then we return the final selected feature names which we can use to create a new model.
        
    Return:
        - Feature List Names 
        - Target Name
        - A Dataframe comprising Correlation Scores and P-Values for each feature
    '''
    start_time = datetime.now()
    features = feature_df.values
    target = target_df.values.ravel()
    # target = target.astype('int')
    # print('Feature shape:',features.shape)
    # print('Target shape:',target.shape)
    
    # feature extraction
    print('User has asked for maxx {} features'.format(max_features_user_wants))
    test = SelectKBest(score_func = scoring_function, k=max_features_user_wants)
    '''
    score_func: Function taking two arrays X and y, and returning a pair of arrays 
                (scores, pvalues) or a single array with scores. 
    
    f_regression: F-value between label/feature for regression tasks.
    The goal of the F-test is to provide significance level. If you want to make sure the features your are 
    including 
    are significant with respect to your 𝑝-value, you use an F-test. 
    If you just want to include the π‘˜ best features, you can use the correlation only.
    Ref: 
        - https://stats.stackexchange.com/questions/204141/difference-between-selecting-features-based-on-f-regression-and-based-on-r2
        - https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression
        
    Alternatively we can use:
        - mutual_info_regression: Mutual information for a continuous target.
        Estimate mutual information for a continuous target variable.
        Mutual information (MI) between two random variables is a non-negative value, which measures the 
        dependency between the variables. It is equal to zero if and only if two random variables are independent,
        and higher values mean higher dependency.
        
        The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors 
        distances.
    
    ref: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest
    '''
    
    fit = test.fit(features,target)
    feature_scores = fit.scores_[fit.get_support()]
    print('Completed SelectKBest function')

    mask = test.get_support()
    new_features = feature_df.columns[mask]
    prep_df = pd.DataFrame()
    prep_df['Columns'] = feature_df.columns
    prep_df['{}_scores'.format(scoring_function.__name__)] = fit.scores_
    prep_df['P_values'] = fit.pvalues_
    out_list = []
    for column in feature_df.columns:
        corr_tuple = pearsonr(feature_df[column], target)
        out_list.append(corr_tuple[0])
    
    prep_df['Correlation_scores'] = out_list
    # print(fit.scores_)
    print('Completed Seleting new features.')
    end_time = datetime.now()
    print('Total time taken for SelectKBest execution: {}'.format(end_time - start_time))
    return new_features,target,prep_df

F_regression Verification/Result Validation

Executing as one single batch/mini-batches:

  • Target Tag: 98P503.MV
    • Tag in PRD Environment
    • Results are documented in the slide pack also.
Correlated Features SelectKBest Features WB Features LIVE Model KPI SelectKBest Model KPI
image
  • 98F501.PV
  • 98F503.PV
  • 98F523.PV
  • 98F520.PV
  • 98F521.PV
  • 98F525.PV
  • 98P508.PV
  • 98P716.PV
  • 98T685.PV
  • 98T682.PV
  • 98F501.PV
  • 98F503.PV
  • 98F524.PV
  • 98P503.PV
  • 98F520.PV
  • 98F521.PV
Insample Metrics :
  • MAE: 0.767
  • RMSE: 1.051
  • R2: 0.862
Outsample Metrics :
  • MAE: 1.141
  • RMSE: 1.490
  • R2: 0.507
LIVE Model C3 UI
Insample Metrics :
  • MAE: 0.776
  • RMSE: 1.083
  • R2: 0.866
Outsample Metrics :
  • MAE: 1.620
  • RMSE: 2.062
  • R2: 0.598
SelectKBest Model C3 UI
  • Target Tag: CD6:602TC457.PV
    • Tag in QA Environment
    • Results are documented in the slide pack also.
Correlated Features SelectKBest Features WB Features LIVE Model KPI SelectKBest Model KPI
image
  • CD6:602TT457.PV
  • CD6:602TC456.PV
  • CD6:602TT456.PV
  • CD6:602TC451.PV
  • CD6:602TT451.PV
  • CD6:602TC455.PV
  • CD6:602TT455.PV
  • CD6:602TC453.PV
  • CD6:602TT453.PV
  • CD6:602TC452.PV
  • CD6:602TC452.PV
  • CD6:602TT125.PV
  • CD6:602TC453.PV
  • CD6:602TC451.PV
  • CD6:602TC450.PV
  • CD6:602TC454.PV
  • CD6:602TC455.PV
  • CD6:602FC360.PV
  • CD6:602TT445.PV
  • CD6:602FC367.PV
Insample Metrics :
  • MAE: 0.900
  • RMSE: 6.254
  • R2: 0.931
Outsample Metrics :
  • MAE: 0.659
  • RMSE: 2.656
  • R2: 0.035
LIVE Model C3 UI
Insample Metrics :
  • MAE: 0.629
  • RMSE: 2.702
  • R2: 0.987
Outsample Metrics :
  • MAE: 0.900
  • RMSE: 3.085
  • R2: -0.303
SelectKBest Model C3 UI

For more results, please refer:

  • PPT: Feature Selection Deck
  • The other TAGS included for proving the concept as part of POC experiments:
    • CD6:602PC037.PV CD6:602TT108.PV MLO:021F064.MV MEOD:093P007.OP P4049_ID2_OP 98P503.MV 114FICA034.MV MLO:021F052.MV LS41243_AO4_OP HV8:008PC113.OP HDS5:86PT122.PV CD6:602TC457.PV CD6:608FC005.OP

Why the results of mini-batch vs full-batch execution of SelectKBest match?

  • Since, SelectKBest uses f_regression scoring function, which uses F-Test/ANOVA Test to identify the significance of relationship between feature and target, this whole process when combined is a univariate process of feature selection.
  • For each feature, the above mentioned formula of π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’_𝑏𝑒𝑑𝑀𝑒𝑒𝑛_π‘π‘™π‘Žπ‘ π‘ π‘’π‘ /π‘π‘œπ‘šπ‘π‘Žπ‘π‘‘π‘›π‘’π‘ π‘ _π‘œπ‘“_π‘π‘™π‘Žπ‘ π‘ π‘’π‘  is calculated and respective F-Score is obtained from the F-table.
  • We performed a small experiment on a piece of equipment, where we chose:
    • Set of first 10 features, and ran SelectKBest to identify the scores.

      • first_10_features_df
    • Set of next 10 features, and ran SelectKBest to identify the scores.

      • next_10_features_df
    • Set of final 100 features, and ran SelectKBest to identify the scores.

      • set_of_100_features_df
    • And, now if we try to see some common features from all the 3 sub-experiments, we find:

      • intersection_result
    • This proves the resulting experiment that the results of mini-batch execution of f_regression matches with full batch execution results.


Comparing both the methodologies

**** SelectKBest Methodology Pearson Correlation Methodology
Computationally Efficient? Equivalent Equivalent
Better Performance? Equivalent Equivalent
Effective in Reducing Training Time? Equivalent Equivalent
More Granular resulting methodology? βœ… As, it is more statistical and free from any bias. Depends only on the linear behaviour, but not on statistical behaviour.

Methodology in review:


Final Results

TAG: MLO:021F064.MV

  • Moerdijk
    • Plant id: 0635EC1E-533F-4FE0-8C46-295F41B28F48
  • Run SelectKBest on the whole plant
  • Document top 10 features as per f_regression score, and create a model using the same
    image

TAG: HV8:008PC113.OP

  • Pernis
    • Plant id: 3A158E3B-AC18-46FA-9720-04917B461035
  • Run SelectKBest on the whole plant.
  • Document top 10 features as per f_regression score, and create a model using the same.
    image

TAG: CVP:77FC5005.OUT

  • Pernis
    • Plant id: BCB4D346-0713-4E53-8BAF-1A8FF6DBBCA7
    • Tag outside of POC
      image
    • WB Features in SelectKBest features?
      image
    • WB Features in Correlation features?
      • None of the WB Features appeared in Correlated features.
        image

TAG: C4TR:040PC431.OP

  • Pernis
    • Plant id: 20BDA776-6197-4CCD-BF37-70149E597DF8
    • Tag outside of POC
      image
    • WB Features in SelectKBest features?
      image
    • WB Features in Correlation features?
      image

TAO Feedback

  • The SelectKBest proposed features are mostly from the same drawing (PID diagram)
  • Some of the features selected could have been included if Correlation would have selected them.

Final Remarks.

  • Results are somewhat similar/little bit improved than the Correlation methodology.
  • As per multiple conversations with @toluogunseyeatshell:
    • We should put it in QA and check out how much it improves as the performance issue with respect to correlation is taken care of.
    • Test it out on more number of TAGS and keep TAO in loop with respect to the features proposed by SelectKBest

Testing Phase Remarks:

Testing Workbook: https://my.shell.com/:x:/g/personal/disha_soni_shell_com/EVW9ZCKnNNtLhwbc4MuntdYBBvzM5gPhH1p0_0zetYY_9g?email=Atul.Mishra%40shell.com&e=tkOENo

  • SelectKBest := 'f-regression' performs almost similar or a slightly better than Correlation, but not that significant enough to conclude it as a break-through methodology.
    • Technical reason being that it is a univariate approach which does involves Correlation and that's why similar performance and features are observed.
    • Also, every-time the methodology consumes more than 11k+ features, Correlation tends out to be the driving factor for feature selection.
  • Most of the WB models tested had features belonging to the same UNIT and are chosen irrespective of their correlation value.
  • Large number of non-GREEN models are also LIVE with reason:
    • The predicted difference offset is very low, meaning prediction trend matches the OP signal for a larger amount of time. Hence made the model LIVE.
    • This gives us the idea to work on AutomatedLIVE model scenario improvement.

Testing Result Summary

image

image

SelectKBest Feature Selection Testing Status:

  • Testing involved 110+ tags.
  • Testing Results summarised in the below workbook:
  • SelectKBest Wiki Page with logic + POC + implementation + Testing on PROD Data remarks:
  • PPT while proposing the POC:
  • Final Comments observed after testing:
    1. Given that SelectKBest works better than Correlation theoretically, it underperforms due to high number of tags considered while running the methodology.
    2. While doing POC, selection of 100 tags were made as UI provides that much only and downloading the data of thousand tags on a minute is not feasible for local system.
    3. On Average 10k+ features are selected for Correlation/SKB Job since the mapping of Hierarchy happens on Plant level and not on Unit Level.
    4. This is the same challenge that Correlation logic also faces currently.
    5. The objective of proposing improved set of TAGS using SelectKBest returns Pearson Correlation Features mostly.
    6. Evidence that most of the LIVE models consume features belong to the UNIT LEVEL:
    7. We should look for a way to decrease the number of feature TAGS considered for correlation/skb execution.
      a. Possible via C3 mapping.
      b. Generalizing TAG Nomenclature.
      c. This can help in identifying UNIT LEVEL TAGS.

Key Takeaways:

1. The pool of features considered for Feature Selection strategy needs to be refined more in order to achieve better set of features.

2. SelectKBest strategy works better on the UNIT level hierarchy as the F-Statistic helps in achieving same characteristic feature tags.

  • Also, Overall significance of F-statistics helps in identifying model significance. Keeping such features is going to result in a statistically significant model.

3. We should continue with Univariate feature selection methodology of Correlation.

  • The feature selection strategy discussed above is helpful in pre-modeling phase, in addition to current correlation based approach, which we recommend to keep as is.

4. Caveat - The results though tested on 100 CV use cases, are reflection of data used for training purposes only. Having a different period of analysis may or may not change the results significantly.

⚠️ **GitHub.com Fallback** ⚠️